Large-Scale Text Collection for Unwritten Languages

نویسندگان

Florian R. Hanke

Steven Bird

چکیده

Existing methods for collecting texts from endangered languages are not creating the quantity of data that is needed for corpus studies and natural language processing tasks. This is because the process of transcribing and translating from audio recordings is too onerous. A more effective method, we argue, is to involve local speakers in the field location, using an audio-only translation interface that is portable and easy to use. We present encouraging early results of an experimental investigation of the efficiency of creating translations using this method, and report on the quality of the resulting content.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Collecting Bilingual Audio in Remote Indigenous Communities

Most of the world’s languages are under-resourced, and most under-resourced languages lack a writing system and literary tradition. As these languages fall out of use, we lose important sources of data that contribute to our understanding of human language. The first, urgent step is to collect and orally translate a large quantity of spoken language. This can be digitally archived and later tra...

متن کامل

Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords

With the advent of informal electronic communications such as social media, colloquial languages that were historically unwritten are being written for the first time in heavily code-switched environments. We present a method for inducing portions of translation lexicons through the use of expert knowledge in these settings where there are approximately zero resources available other than a lan...

متن کامل

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages

The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of “low density”, where only few text data exists online. The aim of this approach is to ...

متن کامل

Resource Lean and Portable Automatic Text Summarization

Today, with digitally stored information available in abundance, even for many minor languages, this information must by some means be filtered and extracted in order to avoid drowning in it. Automatic summarization is one such technique, where a computer summarizes a longer text to a shorter non-rendundant form. Apart from the major languages of the world there are a lot of languages for which...

متن کامل

Normalising Audio Transcriptions for Unwritten Languages

The task of documenting the world’s languages is a mainstream activity in linguistics which is yet to spill over into computational linguistics. We propose a new task of transcription normalisation as an algorithmic method for speeding up the process of transcribing audio sources, leading to text collections of usable quality. We report on the application of sentence and word alignment algorith...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Large-Scale Text Collection for Unwritten Languages

نویسندگان

چکیده

منابع مشابه

Collecting Bilingual Audio in Remote Indigenous Communities

Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages

Resource Lean and Portable Automatic Text Summarization

Normalising Audio Transcriptions for Unwritten Languages

عنوان ژورنال:

اشتراک گذاری